High-dimensional analysis with pytometry¶
Table of contents¶
What is pytometry?¶
Pytometry is a Python package that provides tools for reading in flow cytometry data and applying pre-processing tools. While it provides methods for compensation and transformation (just like flowkit), it's focus is on the high-dimensional and unsupervised analysis of cytometry data. 'Pytometry' uses a data structure called AnnData ("Annotated Data") that is commonly used for single cell analysis, most notably the scanpy package often used for RNAseq analysis. This allows us to apply algorithms and visualisations to our cytometry data from the transcriptomics world. The AnnData structure may seem a little intimiating at first, but we'll learn as we go, and hopefully you'll see it's just a collection of matrices/DataFrames of different kinds of data, all linked back to the data by the sample metadata.
In the diagram below (from the AnnData package's documentation):
Xis the expression matrix where each row is an event and each column is a variablevaris a DataFrame of metadata about each variable inXobsis a DataFrame of metadata about each observation (event) inXunsis a dictionary of unstructured metadata that doesn't fit anywhere else
There are some other subfields in the diagram that either aren't relevant for not, or we'll come back to them later.
Note: pytometry assumes your data have already been cleaned of debris, dead cells, and doublets. Most people will find it easiest to do this using software like FlowJo and exporting cleaned .fcs or .csv files. Alternatively this can be done in flowkit as in the previous tutorial.
We start by importing the packages we're going to need. There are quite a lot of them in this tutorial so you may wish to take note of the alias we use for each. Don't forget that before using these packages you will need to install them, for example using pip install bokeh pytometry in a terminal/powershell.
import scanpy as sc # for single cell analysis
import anndata as ann # for annotated data structure
import cytonormpy as cnp # for correcting batch effects
import numpy as np # for numpy arrays
import pandas as pd # for pandas dataframes
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for more plotting functions
import joypy # for making ridgeplots
import pytometry as pm # for high-dim cytometry functions
import os # for working with directories
from sklearn.decomposition import PCA # for principal component analysis
import scipy # for stats functions
Reading in data¶
For this tutorial, you should have the 16 example .fcs files in the folder data/clean from your working/project directory. You can check this using the os.listdir() function as below. You should see the same output as shown here.
path = 'data/clean/'
files = os.listdir(path)
files
['Mock_01_A.fcs', 'Mock_02_A.fcs', 'Mock_03_A.fcs', 'Mock_04_A.fcs', 'Mock_05_B.fcs', 'Mock_06_B.fcs', 'Mock_07_B.fcs', 'Mock_08_B.fcs', 'sample_details.csv', 'Virus_01_A.fcs', 'Virus_02_A.fcs', 'Virus_03_A.fcs', 'Virus_04_A.fcs', 'Virus_05_B.fcs', 'Virus_06_B.fcs', 'Virus_07_B.fcs', 'Virus_08_B.fcs']
We'll make use of that .csv file later, but for now let's get a list containing just the .fcs file names.
files = [fileID for fileID in files if fileID.endswith(".fcs")]
files
['Mock_01_A.fcs', 'Mock_02_A.fcs', 'Mock_03_A.fcs', 'Mock_04_A.fcs', 'Mock_05_B.fcs', 'Mock_06_B.fcs', 'Mock_07_B.fcs', 'Mock_08_B.fcs', 'Virus_01_A.fcs', 'Virus_02_A.fcs', 'Virus_03_A.fcs', 'Virus_04_A.fcs', 'Virus_05_B.fcs', 'Virus_06_B.fcs', 'Virus_07_B.fcs', 'Virus_08_B.fcs']
The next step is to read these .fcs files into a list of AnnData objects. First we create an empty list that will hold the data, then we iterate over each filename, reading in the data with the pm.io.read_fcs() function, removing unwanted parameters with the pm.pp.split_signal() function, then adding that data to the list using the append() method. If you call ?pm.pp.split_signal you'll see this function removes any non-area parameters if data_type = 'facs' and removes any non-element parameters if data_type = 'cytof'.
adatas = []
for fileID in files:
adata = pm.io.read_fcs(path + fileID)
pm.pp.split_signal(
adata,
var_key = "channel",
data_type = "facs"
)
adatas.append(adata)
We can subset adatas just like any list. If we look at just the first element, we're told it's an AnnData object with 10,000 events and 8 parameters.
adatas[0]
AnnData object with n_obs × n_vars = 10000 × 8
var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnG', '$PnR', 'signal_type'
uns: 'meta'
We're also told this object has a DataFrame of variable-level metadata (var), and some unstructured metadata (uns). We can access the variable-level metadata by calling the var attribute, and we see we get a DataFrame with some information about each parameter in the original .fcs file. The unstructured metadata just contains the original .fcs file keyword data, and can be accessed using the .uns attribute.
adatas[0].var
| n | channel | marker | $PnB | $PnE | $PnG | $PnR | signal_type | |
|---|---|---|---|---|---|---|---|---|
| CD3e | 1 | PECy5-5-A | CD3e | 32 | 0,0 | 1.0 | 262144 | area |
| CD16-32 | 2 | PECy7-A | CD16-32 | 32 | 0,0 | 1.0 | 262144 | area |
| Ly6G | 3 | DL800-A | Ly6G | 32 | 0,0 | 1.0 | 262144 | area |
| CD45 | 4 | AF700-A | CD45 | 32 | 0,0 | 1.0 | 262144 | area |
| CD48 | 5 | APCCy7-A | CD48 | 32 | 0,0 | 1.0 | 262144 | area |
| CD11b | 6 | BUV395-A | CD11b | 32 | 0,0 | 1.0 | 262144 | area |
| B220 | 7 | BUV737-A | B220 | 32 | 0,0 | 1.0 | 262144 | area |
| Ly6C | 8 | BV605-A | Ly6C | 32 | 0,0 | 1.0 | 262144 | area |
adatas[0].uns
OrderedDict([('meta',
{'__header__': {'FCS format': 'FCS3.1',
'text start': 256,
'text end': 909,
'data start': 910,
'data end': 320909,
'analysis start': 0,
'analysis end': 0},
'$BEGINANALYSIS': '0',
'$BEGINDATA': '910',
'$BEGINSTEXT': '0',
'$BYTEORD': '1,2,3,4',
'$DATATYPE': 'F',
'$ENDANALYSIS': '0',
'$ENDDATA': '320909',
'$ENDSTEXT': '0',
'$MODE': 'L',
'$NEXTDATA': 0,
'$PAR': 8,
'$TOT': 10000,
'channels': $PnN $PnS $PnB $PnE $PnG $PnR
n
1 PECy5-5-A CD3e 32 0,0 1.0 262144
2 PECy7-A CD16-32 32 0,0 1.0 262144
3 DL800-A Ly6G 32 0,0 1.0 262144
4 AF700-A CD45 32 0,0 1.0 262144
5 APCCy7-A CD48 32 0,0 1.0 262144
6 BUV395-A CD11b 32 0,0 1.0 262144
7 BUV737-A B220 32 0,0 1.0 262144
8 BV605-A Ly6C 32 0,0 1.0 262144,
'header': {'FCS format': 'FCS3.1',
'text start': 256,
'text end': 909,
'data start': 910,
'data end': 320909,
'analysis start': 0,
'analysis end': 0}})])
At present our data are stored as a list of AnnDatas but it will be more convenient if we can merge these into a single master AnnData. We can achieve this using the .concat() method. The label argument defines the name of the new column in the observation-level metadata (obs) that will indicate the filename each event belongs to, and the keys argument defines what these values will be. The merge = 'same' argument tells the function to keep only the var and uns metadata if it is the same for all samples. This ensures that all samples have the same parameters, but results in the uns matrix being dropped as this is different for each sample (but we don't mind that).
adata_all = ann.concat(
adatas,
label = 'filename',
keys = files,
merge = 'same'
)
adata_all
c:\Python_3.10.2\lib\site-packages\anndata\_core\anndata.py:1818: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
utils.warn_names_duplicates("obs")
AnnData object with n_obs × n_vars = 160000 × 8
obs: 'filename'
var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnG', '$PnR', 'signal_type'
Let's take a look at this observation-level metadata I just mentioned. We can view it by calling the obs attribute of our object. At present we just have the filename column we created to indicate the file each even belongs to. Later we will add more metadata like experimental group and batch.
adata_all.obs
| filename | |
|---|---|
| 0 | Mock_01_A.fcs |
| 1 | Mock_01_A.fcs |
| 2 | Mock_01_A.fcs |
| 3 | Mock_01_A.fcs |
| 4 | Mock_01_A.fcs |
| ... | ... |
| 9995 | Virus_08_B.fcs |
| 9996 | Virus_08_B.fcs |
| 9997 | Virus_08_B.fcs |
| 9998 | Virus_08_B.fcs |
| 9999 | Virus_08_B.fcs |
160000 rows × 1 columns
As a quick sense-check, let's see how many events we have for each value of the filename column in obs:
adata_all.obs['filename'].value_counts()
Mock_01_A.fcs 10000 Mock_02_A.fcs 10000 Mock_03_A.fcs 10000 Mock_04_A.fcs 10000 Mock_05_B.fcs 10000 Mock_06_B.fcs 10000 Mock_07_B.fcs 10000 Mock_08_B.fcs 10000 Virus_01_A.fcs 10000 Virus_02_A.fcs 10000 Virus_03_A.fcs 10000 Virus_04_A.fcs 10000 Virus_05_B.fcs 10000 Virus_06_B.fcs 10000 Virus_07_B.fcs 10000 Virus_08_B.fcs 10000 Name: filename, dtype: int64
We can see we have 10,000 events per file. We won't always have the same number of events per sample, but it's good to check we have what we expect, and to ensure we don't have files with too few events.
Transforming channel values¶
If your data were exported from FlowJo, then all the parameters will be linear. In order for our dimension reduction and clustering algorithms to provide meaningful insights, we need to transform our fluorescence/metal parameters. Pytometry provides a few common transformations such as biexponential, logicle, and asinh. The asinh (inverse hyperbolic sin) transformation is almost exclusively used for mass cytometry data, while any of these transformations may be applied to fluorescence cytometry data. You can experiment with different transformations, but for today we will use biexponential.
We do that using the normalize_biexp() function, giving the AnnData object, and s eries of parameters to the transformation. You can experiment with these paremeters, but I took these particular values directly from FlowJo. There is also a normalize_autologicle() function for automatically choosing the parameters of a logicle transformation, but I'll let you experiment with that one. These functions modify the values of the axisting AnnData object in place.
pm.tl.normalize_biexp(
adata_all,
negative = 0,
width = -100,
positive = 3.71,
max_value = 65536
)
#pm.tl.normalize_autologicle(adata_all)
To check our choice of transformation is suitable, it’s a good idea to plot each transformed parameter. A quick way to check if our negative populations are squeezed or split by the transformation, is to create ridge plots (sometime called joyplots). We do this using the joyplot() function from the package of the same name. This function doesn't accept AnnData objects as input, so we convert the expression data to a DataFrame using to_df() then randomly sample 10,000 events to speed the plotting up. To the column argument we give a list of variable names we access from the var_names attribute (converting it to a list as is expected by joyplot()), where every variable will be plotted as a separate distribution. The overlap argument just controls how much the distributions overlap and you can experiment with this.
I start by using the do.subsample() function to randomly sample 10,000 events from the data (just to speed things up). Next, I use lapply() to apply a function to every element in our transformed_cols vector. The function I apply is an anonymous function that calls make.colour.plot() on the data, iterating over each element of the transformed_cols vector as the x variable, and fixing “AF700_CD45_asinh” as the y variable.
p = joypy.joyplot(
adata_all.to_df().sample(10_000),
column = adata.var_names.to_list(),
overlap = 1.5
)
Those negative populations look ok, but looking at bivariate plots is a good idea too. Below, we use seaborn's histplot() function for plotting a 2D histogram of CD45 against every other parameter. We extract the expression data as a numpy array by calling the X (capital X) attribute from our AnnData object (or we could have extracted it as a DataFrame as we did above). We iterate over the parameters, subsetting for the column for the current parameter.
for idx, name in enumerate(adata_all.var_names):
p = sns.histplot(
x = adata_all.X[:, idx],
y = adata_all.X[:, 3],
bins = 200,
cmap = 'viridis'
)
plt.xlabel(name)
plt.ylabel('CD45')
plt.show()
After inspecting each of the plots, we may wish to go back and change our transformation. This transformation looks fine, so we’ll continue.
Adding sample metadata¶
Often, we will have sample metadata we wish to include in our analysis that couldn’t read in as part of our .fcs files. This might include sample identifiers and grouping variables such as treatment, concentration, and time. It’s easy to add this information to our AnnData object by manually creating a .csv file with the relevant information, and merging this with the obs DataFrame.
In your data/clean/ directory, you should have the file "sample_details.csv" that contains the metadata for this example. If you open the file as a spreadsheet, you'll see it contains the columns:
- filename
- sample
- group
- batch_label
- batch
- reference
It doesn’t need all of these, and could contain more, depending on what variables you want to use to annotate your samples (you could include donor age, for example). However note that (at the time of writing this) the cytonorm algorithm does expect the column batch to contain only integers and the column reference to containg the value 'ref' for reference controls, and 'other' for non-reference controls (more on all of this later).
We read this .csv file in as a DataFrame and merge it with the obs DataFrame, matching the metadata between the files using the filename column. You can think of it as taking each row in obs in turn, looking at the filename, then grabbing the metadata corresponding to that filename in metadata.
metadata = pd.read_csv('data/clean/sample_details.csv')
metadata
| filename | sample | group | batch_label | batch | reference | |
|---|---|---|---|---|---|---|
| 0 | Mock_01_A.fcs | Mock_01_A | Mock | A | 1 | ref |
| 1 | Mock_02_A.fcs | Mock_02_A | Mock | A | 1 | other |
| 2 | Mock_03_A.fcs | Mock_03_A | Mock | A | 1 | other |
| 3 | Mock_04_A.fcs | Mock_04_A | Mock | A | 1 | other |
| 4 | Virus_01_A.fcs | WNV_01_A | Virus | A | 1 | other |
| 5 | Virus_02_A.fcs | WNV_02_A | Virus | A | 1 | other |
| 6 | Virus_03_A.fcs | WNV_03_A | Virus | A | 1 | other |
| 7 | Virus_04_A.fcs | WNV_04_A | Virus | A | 1 | other |
| 8 | Mock_05_B.fcs | Mock_05_B | Mock | B | 2 | ref |
| 9 | Mock_06_B.fcs | Mock_06_B | Mock | B | 2 | other |
| 10 | Mock_07_B.fcs | Mock_07_B | Mock | B | 2 | other |
| 11 | Mock_08_B.fcs | Mock_08_B | Mock | B | 2 | other |
| 12 | Virus_05_B.fcs | WNV_05_B | Virus | B | 2 | other |
| 13 | Virus_06_B.fcs | WNV_06_B | Virus | B | 2 | other |
| 14 | Virus_07_B.fcs | WNV_07_B | Virus | B | 2 | other |
| 15 | Virus_08_B.fcs | WNV_08_B | Virus | B | 2 | other |
adata_all.obs = pd.merge(adata_all.obs, metadata, on = 'filename', how = 'left')
c:\Python_3.10.2\lib\site-packages\anndata\_core\anndata.py:767: UserWarning:
AnnData expects .obs.index to contain strings, but got values like:
[0, 1, 2, 3, 4]
Inferred to be: integer
value_idx = self._prep_dim_index(value.index, attr)
We get a little warning from AnnData telling us it likes the index (the rownames) of obsto be strings, but the merge() function returned integers. We fix that easily below and look at our newly added metadata.
adata_all.obs.index = adata_all.obs.index.astype(str)
adata_all.obs
| filename | sample | group | batch_label | batch | reference | |
|---|---|---|---|---|---|---|
| 0 | Mock_01_A.fcs | Mock_01_A | Mock | A | 1 | ref |
| 1 | Mock_01_A.fcs | Mock_01_A | Mock | A | 1 | ref |
| 2 | Mock_01_A.fcs | Mock_01_A | Mock | A | 1 | ref |
| 3 | Mock_01_A.fcs | Mock_01_A | Mock | A | 1 | ref |
| 4 | Mock_01_A.fcs | Mock_01_A | Mock | A | 1 | ref |
| ... | ... | ... | ... | ... | ... | ... |
| 159995 | Virus_08_B.fcs | WNV_08_B | Virus | B | 2 | other |
| 159996 | Virus_08_B.fcs | WNV_08_B | Virus | B | 2 | other |
| 159997 | Virus_08_B.fcs | WNV_08_B | Virus | B | 2 | other |
| 159998 | Virus_08_B.fcs | WNV_08_B | Virus | B | 2 | other |
| 159999 | Virus_08_B.fcs | WNV_08_B | Virus | B | 2 | other |
160000 rows × 6 columns
Batch alignment¶
If samples were not all stained and collected together, it’s possible that batch effects exist in the data that contribute added noise. Preventing batch effects before they occur is the best strategy, but if samples must be collected at distant time points, then we can mitigate their impact by including a common sample to each round of staining and acquisition, and using the CytoNorm algorithm to align the batches based on these batch control files.
Let's start by looking for evidence of a batch effect between our two batches. A fairly effective way of doing this is to find a lower-dimensional representation of the data, and visualize it to see if batches are overlapping. We'll use the UMAP dimension reduction algorithm for this.
To speed things up, we start by taking a random sample of 50,000 events from our AnnData object and storing this in a new AnnData object called adata_sub. Before running UMAP, we must calculate the distance between each event and its nearest neighbors using sc.pp.neighbors(). We then calculate the UMAP embedding with sc.tl.umap(), where min_dist controls the granularity of the final embedding. Finally, we can plot the embedding using sc.pl.umap(), giving a list of variable sfrom obs we wish to colour by (each resulting in a subplot).
adata_sub = sc.pp.subsample(adata_all, n_obs = 50_000, copy = True)
sc.pp.neighbors(adata_sub)
sc.tl.umap(adata_sub, min_dist = 0.1)
sc.pl.umap(adata_sub, color = ['batch_label', 'group', 'sample'], size = 3, ncols = 2)
c:\Python_3.10.2\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored cax = scatter( c:\Python_3.10.2\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored cax = scatter( c:\Python_3.10.2\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored cax = scatter(
The most salient plot at the moment is the one showing batch_label, and we can see that the two batches do not overlap (remember these data are from the same biological sample and should be near identical). Let's proceed with performing batch correction to reduce the impact of this batch effect.
Preparing the batch control data¶
We start by creating a CytoNorm object, and adding a FlowSOM clusterer that will partition the data into 8 clusters. The cytonorm algorithm relies on first clustering the data so that each cluster can be aligned separately, and we only need to capture the broad cell types for now.
cn = cnp.CytoNorm()
fs = cnp.FlowSOM(n_clusters = 5)
cn.add_clusterer(fs)
It's useful at this point to introduce AnnData's layers functionality. As we work with an AnnData object we may make changes to the data X, for example in transforming it or normalising it, and it can be helpful to retain multiple versions of the data, rather than just keep overwriting it. We can do this by creating layers within our AnnData object. We'll do this now so we can have seperate layers for the original, and normalised expression data.
To add a new layer, we use the layers attribute and use square brackets as if we're adding a new column to a DataFrame. In the example below, I just copy the existing values of X. When we call adata_all, we can see we now have a layer called 'original'.
adata_all.layers['original'] = adata_all.X.copy()
adata_all
AnnData object with n_obs × n_vars = 160000 × 8
obs: 'filename', 'sample', 'group', 'batch_label', 'batch', 'reference'
var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnG', '$PnR', 'signal_type'
layers: 'original'
Next, we run the run_anndata_setup() method, which lets us define which layer we're going to use as the input data, the name of the layer that will contain the normalised data, and the columns in obs that indicate the batch and sample of each file and whether it is a reference control or not.
cn.run_anndata_setup(
adata_all,
layer = 'original',
reference_column = 'reference',
batch_column = 'batch',
sample_identifier_column = 'sample',
key_added = 'normalised'
)
Looking at adata_all again shows an additional layer that will eventually contain the normalised data (it is just a copy of original at present).
adata_all
AnnData object with n_obs × n_vars = 160000 × 8
obs: 'filename', 'sample', 'group', 'batch_label', 'batch', 'reference'
var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnG', '$PnR', 'signal_type'
layers: 'original', 'normalised'
Now we can run the clustering we defined earlier.
cn.run_clustering()
Training the model and aligning the data¶
We then calculate quantiles per marker, per cluster, and calculate splines that map each of these distributions to the mean distributions of the batch control files. The normalize_data() method then aligns the data based on the trained model.
cn.calculate_quantiles()
cn.calculate_splines(goal = "batch_mean")
cn.normalize_data()
normalized file Mock_07_B normalized file Mock_04_A normalized file Mock_06_B normalized file WNV_02_A normalized file Mock_03_A normalized file Mock_02_A normalized file WNV_01_A normalized file Mock_08_B normalized file WNV_03_Anormalized file WNV_05_B normalized file WNV_04_A normalized file WNV_06_B normalized file WNV_07_B normalized file WNV_08_B
If we want to normalise any files not included in the initial training, or normalize the batch control files themselves, we need to give a list of file names and corresponding list of batch memberships the the normalize_data() method.
cn.normalize_data(file_names = ['Mock_01_A', 'Mock_05_B'], batches = [1, 2])
normalized file Mock_01_A normalized file Mock_05_B
Visualising the alignment¶
It's a good idea to check if we're happy cytonorm's clustering has captured the broad cell types, and that the batch normalisation has been effective.
First, we access the _clustering attribute of our cn object and apply its calculate_clusters() method to the expression matrix to get an array of event cluster memberships. We then convert the cluster memberships to strings (to improve plotting) and store them as a column in obs.
cn_clusters = cn._clustering.calculate_clusters(adata_all.layers['original'])
cn_clusters
array([4, 4, 3, ..., 1, 1, 3], dtype=int64)
adata_all.obs['cytonorm_cluster'] = cn_clusters.astype('str')
adata_all.obs
| filename | sample | group | batch_label | batch | reference | cytonorm_cluster | |
|---|---|---|---|---|---|---|---|
| 0 | Mock_01_A.fcs | Mock_01_A | Mock | A | 1 | ref | 4 |
| 1 | Mock_01_A.fcs | Mock_01_A | Mock | A | 1 | ref | 4 |
| 2 | Mock_01_A.fcs | Mock_01_A | Mock | A | 1 | ref | 3 |
| 3 | Mock_01_A.fcs | Mock_01_A | Mock | A | 1 | ref | 3 |
| 4 | Mock_01_A.fcs | Mock_01_A | Mock | A | 1 | ref | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 159995 | Virus_08_B.fcs | WNV_08_B | Virus | B | 2 | other | 3 |
| 159996 | Virus_08_B.fcs | WNV_08_B | Virus | B | 2 | other | 2 |
| 159997 | Virus_08_B.fcs | WNV_08_B | Virus | B | 2 | other | 1 |
| 159998 | Virus_08_B.fcs | WNV_08_B | Virus | B | 2 | other | 1 |
| 159999 | Virus_08_B.fcs | WNV_08_B | Virus | B | 2 | other | 3 |
160000 rows × 7 columns
We're going to calculate another UMAP embedding of the data to visualise the impact of batch normalisation, as well as to colou by membership to cytonorm's clusters, to decide whether we think it's capturing the broad cell types present.
The sc.pp.neighbors() function we need to use to calculate the nearest neighbour distances, doesn't take a layer as input, but it can take a matrix from our AnnData's obsm (obsersation matrix) subfield. This subfield is for storing observation level data that may have many dimensions to it. Because we want to use the normalised layer data to compute the UMAP, we simply copy this to a obsm slot.
adata_all.obsm['normalised'] = adata_all.layers['normalised']
adata_all.obsm
AxisArrays with keys: normalised
Now we take another subsample, calculate the nearest neighbour distances, calculate the UMAP embedding, and plot the data, colouring by cytonorm_cluster, batch_label, and group.
adata_sub = sc.pp.subsample(adata_all, n_obs = 50_000, copy = True)
sc.pp.neighbors(adata_sub, use_rep = 'normalised')
sc.tl.umap(adata_sub, min_dist = 0.1)
sc.pl.umap(
adata_sub,
layer = 'normalised',
color = ['cytonorm_cluster', 'batch_label', 'group'],
size = 3,
ncols = 1
)
c:\Python_3.10.2\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored cax = scatter( c:\Python_3.10.2\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored cax = scatter( c:\Python_3.10.2\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored cax = scatter(
We can see the batches are now much more superimposed over each other. The cytonorm clusters do an ok job of capturing the broad cell types. But while looking an a UMAP gives a quick overview, it's a good idea to compare the parameter distributions pre-and post normalisation for the batch control files. First, let's create a new AnnData object containing just the data from the batch control samples.
adata_ref = adata_all[adata_all.obs['reference'] == 'ref']
adata_ref
View of AnnData object with n_obs × n_vars = 20000 × 8
obs: 'filename', 'sample', 'group', 'batch_label', 'batch', 'reference', 'cytonorm_cluster'
var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnG', '$PnR', 'signal_type'
obsm: 'normalised'
layers: 'original', 'normalised'
Next, we create separate DataFrames containing the original and normalised expression data. We add columns indicating whether the data are original or normalised, and the batch number for each event. Finally, we concatenate these together into a single DataFrame called both.
orig = adata_ref.to_df(layer = 'original')
orig['group'] = 'original'
orig['batch'] = adata_ref.obs.batch
norm = adata_ref.to_df(layer = 'normalised')
norm['group'] = 'normalised'
norm['batch'] = adata_ref.obs.batch
both = pd.concat([orig, norm])
both
| CD3e | CD16-32 | Ly6G | CD45 | CD48 | CD11b | B220 | Ly6C | group | batch | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1141.502801 | 1884.283173 | 631.694678 | 2096.491398 | 2345.868626 | 1541.958951 | 3431.587756 | 1197.512280 | original | 1 |
| 1 | 1146.759063 | 1001.825654 | 911.361470 | 1859.862764 | 1385.111571 | 1323.273261 | 3219.384459 | 1711.602316 | original | 1 |
| 2 | 1147.575687 | 2237.875240 | 844.361401 | 2068.902565 | 1974.999994 | 2711.006654 | 1238.627796 | 2597.616532 | original | 1 |
| 3 | 1329.867442 | 2329.289555 | 791.594765 | 1443.450031 | 2116.643027 | 3098.270045 | 1792.858119 | 2867.574559 | original | 1 |
| 4 | 1043.718253 | 2040.863053 | 342.438555 | 1501.455922 | 2602.254838 | 2053.469957 | 2690.738776 | 1301.960150 | original | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 49995 | 1093.597985 | 1253.195038 | -48.364905 | 2676.159933 | 2916.001651 | 1291.851753 | 3187.865781 | 1560.876894 | normalised | 2 |
| 49996 | 1198.776982 | 2779.748414 | 3323.263529 | 2151.755543 | 1519.962398 | 3228.117012 | 1025.970601 | 2422.467398 | normalised | 2 |
| 49997 | 1473.892663 | 2433.838854 | 1060.009231 | 1707.809461 | 1881.749736 | 2775.319616 | 1575.442517 | 2821.307012 | normalised | 2 |
| 49998 | 1307.985890 | 2601.736766 | 2882.050968 | 2027.890835 | 929.216557 | 3308.465222 | 1435.532210 | 2285.653496 | normalised | 2 |
| 49999 | 1088.871324 | 2792.563676 | 3106.601275 | 1940.889623 | 1197.101574 | 3130.648391 | 1129.639749 | 2035.476600 | normalised | 2 |
40000 rows × 10 columns
Now we have this DataFrame of annotated, batch control data, we use the joyplot() function from the package of the same name, to visualise the parameter distributions. As you scan through, you'll noice some discrepancies between the distributions pre-normalisation, that are corrected by cytonorm.
for param in adata.var_names.to_list():
p = joypy.joyplot(
both,
title = param,
by = ['group', 'batch'],
column = param,
overlap = 1.5,
grid = True,
color = ['#ABDDA4', '#ABDDA4', '#FDAE61', '#FDAE61']
)
If we weren't happy with the normalisation, we could go back and change the number of clusters, for example. But this looks as though the batch effect has been removed, so now let's replace the X expression matrix in our AnnData object with the normalised expression matrix.
adata_all.X = adata_all.layers['normalised']
Clustering¶
To help us identify clusters or populations of cells in our data, we can employ a clustering algorithm to partition the events into groups that are more similar to each other than they are to events in other groups. Traditional gating is just "clustering by hand", but can be subjective and miss important populations when the parameter space is large.
The pytometry package currently only has an implementation of the FlowSOM clustering algorithm (which is probably the most widely used), but there are Python implementations of other clustering algorithms you could apply, such as phenograph. FlowSOM and phenograph are state of the art and perform well for most cytometry clustering problems. In this example we'll use FlowSOM.
Performing FlowSOM clustering is easy: we simply provide the AnnData object as the first argument, the name of the column that will indicate cluster membership in obs, and arguments that control the clustering algorithm.
The som_dim controls the dimensions of the self organizing map . The greater the dimensions of the SOM, the more initial clusters the data are broken up into. The default is (10, 10), but here I set it to (15, 15) as it seems to result in better clustering. We can set the random seed to ensure reproducibility (pick your favourite number of leave the default of 42), and the min_clusters and max_clusters arguments. If these have the same value, we are essentially instructing FlowSOM to return exactly that number of metaclusters. Otherwise, we can give the algorithm different minimum and maximum numbers, and let it choose for us (it tends to under cluster). It is usually best to set the number of metaclusters a little higher than the number of populations you think may be identifiable in the data (as we can easily merge clusters). We can alter the number of metaclusters later if we believe we are under or over clustering.
pm.tl.flowsom_clustering(
adata_all,
key_added = 'fsom_metacluster',
som_dim = (15, 15),
seed = 24601,
min_clusters = 15,
max_clusters = 15,
verbose = True
)
[ 500 / 500 ] 100% - 0:00:00 left quantization error: 650.0046638720019
Computing consensus matrices: 0%| | 0/1 [00:00<?, ?it/s]
c:\Python_3.10.2\lib\site-packages\kneed\knee_locator.py:225: RuntimeWarning: invalid value encountered in divide return (a - min(a)) / (max(a) - min(a)) c:\Python_3.10.2\lib\site-packages\kneed\knee_locator.py:208: RuntimeWarning: Mean of empty slice. self.S * np.abs(np.diff(self.x_normalized).mean()) c:\Python_3.10.2\lib\site-packages\numpy\core\_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide ret = ret.dtype.type(ret / rcount) c:\Python_3.10.2\lib\site-packages\consensusclustering\consensus.py:464: UserWarning: Kneedle algorithm failed to find a knee. Returning maximum number of clusters, however, it is likely that the clustering is unstable. Plot the CDFs and consensus matrices to check. warn(
Assigning cluster labels to cells: 0%| | 0/160000 [00:00<?, ?it/s]
AnnData object with n_obs × n_vars = 160000 × 8
obs: 'filename', 'sample', 'group', 'batch_label', 'batch', 'reference', 'cytonorm_cluster', 'fsom_metacluster'
var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnG', '$PnR', 'signal_type'
obsm: 'normalised'
layers: 'original', 'normalised'
We can access the cluster memberships from the fsom_metacluster column from obs. Below we just show we do indeed have 15 metaclusters.
len(adata_all.obs['fsom_metacluster'].unique())
15
Let's subsample the data and calculate the UMAP embedding again, so we can colour the points by metacluster.
adata_sub = sc.pp.subsample(adata_all, n_obs = 50_000, copy = True)
sc.pp.neighbors(adata_sub)
sc.tl.umap(adata_sub, min_dist = 0.1)
sc.pl.umap(adata_sub, color = 'fsom_metacluster', size = 8)
c:\Python_3.10.2\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored cax = scatter(
Finally, to help us assign cell types to our clusters ("annotating" the clusters), let’s plot the UMAP embeddings, but faceting by antigen, and colouring by expression of that antigen.
sc.pl.umap(
adata_sub,
gene_symbols = 'marker',
color = adata_sub.var_names,
ncols = 3
)
We can also create bivariate plots, colouring by metacluster, as this may help us when annotating the clusters later. Below we plot every parameter against CD45.
for param in adata_sub.var_names:
sc.pl.scatter(adata_sub, x = param, y = 'CD45', color = 'fsom_metacluster', size = 8, )
Annotating our clusters¶
The process of assigning a biological cell type to each cluster is usually the hardest and most labour intensive part of the analysis. To help with this, in addition to the UMAP plots we created a moment ago, scanpy provides us with a few different visualisation tools. You may not like all of them, but we'll look at them here one by one so you have a feeel of what's available.
Let's start with what scanpy calls a dotplot, which is a heatmap where the size of each "dot" is mapped to an estimate of the proportion of events positive for each marker. This is calculated based on a single expression cutoff value, which for the example we set to 2000.
sc.pl.dotplot(
adata_all,
var_names = adata_all.var_names,
groupby = 'fsom_metacluster',
dendrogram = True,
expression_cutoff = 2000
)
WARNING: dendrogram data not found (using key=dendrogram_fsom_metacluster). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
c:\Python_3.10.2\lib\site-packages\scanpy\plotting\_dotplot.py:747: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap', 'norm' will be ignored dot_ax.scatter(x, y, **kwds)
Another way to visualize metacluster expression is through violin plots. Here we iterate through each parameter and visualize its distribution per metacluster with violin plots. Setting stripplot = False suppresses the drawing of individual events, and speeds up plotting. These plots are also useful for identifying under-clustering, where we may see bimodal or even multimodal parameter distributions.
for param in adata_all.var_names:
sc.pl.violin(
adata_all,
keys = param,
groupby = 'fsom_metacluster',
rotation = 90,
stripplot = False
)
But the visualisation many people are more comfortable with is the humble heatmap, which scanpy calls a matrixplot.
sc.pl.matrixplot(
adata_all,
var_names = adata_all.var_names,
groupby = 'fsom_metacluster',
dendrogram = True
)
The heatmap above is displaying the mean of the transformed parameter values per metacluster. But it's also useful to scale the expression values between 0 and 1 within each column (parameter) first, to visualise the metaclusters expressing the most and least amount of each antigen. We do this by providing standard_scale = 'var' (to do the same within each column, set this argument to 'group', but this isn't as useful).
sc.pl.matrixplot(
adata_all,
var_names = adata_all.var_names,
groupby = 'fsom_metacluster',
dendrogram = True,
standard_scale = 'var',
colorbar_title = 'Column scaled\nexpression'
)
The visualization scanpy calls a heatmap is similar to the visualization above, except that every individual event is included in the plot. The only useful thing about this is that we can clearly see the proportion of total each metacluster represents, and we can use this information to inform cell type annotations.
sc.pl.heatmap(
adata_all,
var_names = adata_all.var_names,
groupby = 'fsom_metacluster',
standard_scale = 'var',
dendrogram = True
)
Armed with our visualisations above, now we need to start assigning cell types to our FlowSOM metaclusters. Note that it’s better to over cluster than to under cluster, as at this point we can merge clusters we believe represent the same cell type.
To annotate, we create a dictionary of key:value pairs, where the key of each element is the name of the metacluster, and the value of each element is cell type we wish to assign to that metacluster.
Once we have created our dictionary, we use the map() method on the fsom_metacluster column, which looks at each value of that column, finds the corresponding key in the dictionary we provide as an argument, and pulls out the value. This is then returned as a single series which we assign as a column of the obs subfield.
annotations = {
'cluster_0' : 'Other/Unknown',
'cluster_1' : 'Monocytes',
'cluster_2' : 'Other/Unknown',
'cluster_3' : 'Monocytes',
'cluster_4' : 'Other/Unknown',
'cluster_5' : 'Mature neutrophils',
'cluster_6' : 'Immature neutrophils',
'cluster_7' : 'Mature neutrophils',
'cluster_8' : 'Plasma cells',
'cluster_9' : 'B cells',
'cluster_10' : 'T cells',
'cluster_11' : 'T cells',
'cluster_12' : 'T cells',
'cluster_13' : 'B cells',
'cluster_14' : 'B cells',
}
adata_all.obs['population'] = adata_all.obs['fsom_metacluster'].map(annotations)
adata_all.obs
| filename | sample | group | batch_label | batch | reference | cytonorm_cluster | fsom_metacluster | population | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Mock_01_A.fcs | Mock_01_A | Mock | A | 1 | ref | 4 | cluster_14 | B cells |
| 1 | Mock_01_A.fcs | Mock_01_A | Mock | A | 1 | ref | 4 | cluster_9 | B cells |
| 2 | Mock_01_A.fcs | Mock_01_A | Mock | A | 1 | ref | 3 | cluster_1 | Monocytes |
| 3 | Mock_01_A.fcs | Mock_01_A | Mock | A | 1 | ref | 3 | cluster_1 | Monocytes |
| 4 | Mock_01_A.fcs | Mock_01_A | Mock | A | 1 | ref | 4 | cluster_9 | B cells |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 159995 | Virus_08_B.fcs | WNV_08_B | Virus | B | 2 | other | 3 | cluster_3 | Monocytes |
| 159996 | Virus_08_B.fcs | WNV_08_B | Virus | B | 2 | other | 2 | cluster_11 | T cells |
| 159997 | Virus_08_B.fcs | WNV_08_B | Virus | B | 2 | other | 1 | cluster_5 | Mature neutrophils |
| 159998 | Virus_08_B.fcs | WNV_08_B | Virus | B | 2 | other | 1 | cluster_5 | Mature neutrophils |
| 159999 | Virus_08_B.fcs | WNV_08_B | Virus | B | 2 | other | 3 | cluster_6 | Immature neutrophils |
160000 rows × 9 columns
So we can create a nice UMAP plot labelled with our cell types, let’s do the same for the adata_sub DataFrame, then plot the UMAP embedding with the cell type labels.
adata_sub.obs['population'] = adata_sub.obs['fsom_metacluster'].map(annotations)
sc.pl.umap(adata_sub, color = 'population', legend_loc = 'on data')
c:\Python_3.10.2\lib\site-packages\scanpy\plotting\_tools\scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored cax = scatter(
Lastly, let’s recreate our expression heatmaps using our annotated cell types as rows, instead of the metaclusters.
sc.pl.matrixplot(
adata_all,
var_names = adata_all.var_names,
groupby = 'population',
dendrogram = True,
standard_scale = 'var',
colorbar_title = 'Column scaled\nexpression'
)
WARNING: dendrogram data not found (using key=dendrogram_population). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
sc.pl.heatmap(
adata_all,
var_names = adata_all.var_names,
groupby = 'population',
standard_scale = 'var',
dendrogram = True
)
Exporting the data¶
Now that our original .fcs data are combined and annotated with the metadata and population data, it’s a good idea to save a snapshot so we can come back to where we left off in future. The write_csvs() method will write seperate .csv files for each subfield of our AnnData object, while the write() method stores the entire object as a .h5ad file, which is AnnData's file format. We can read in an AnnData object from its .h5ad file using the read_h5ad() function (I do this just as a demonstration below).
adata_all.write_csvs('data/exported/annotated')
adata_all.write('data/exported/annotated.h5ad')
adata_2 = ann.read_h5ad('data/exported/annotated.h5ad')
adata_2
AnnData object with n_obs × n_vars = 160000 × 8
obs: 'filename', 'sample', 'group', 'batch_label', 'batch', 'reference', 'cytonorm_cluster', 'fsom_metacluster', 'population'
var: 'n', 'channel', 'marker', '$PnB', '$PnE', '$PnG', '$PnR', 'signal_type'
uns: 'dendrogram_fsom_metacluster', 'dendrogram_population', 'fsom_metacluster_colors', 'population_colors'
obsm: 'normalised'
layers: 'normalised', 'original'
Statistical analysis¶
So far we have partitioned our dataset into populations of cells. Now what? The next step in your analysis will depend on the questions you are asking, but a general starting point is to generate summary statistics of our data that include:
- the proportion of all events in each population
- the median fluorescence/mass intensity (MFI) of each antigen within population
Generating summary statistics¶
We can generate the percentage of total made up by each population in each sample, using the crosstab() function from Pandas. By giving the first and second arguments as the 'sample' and 'population' columns from obs, the function will return a DataFrame of counts (i.e. the number of cells of each type in each sample). By specifying normalize = 'index', we ask the function to convert those counts into a proportion that sums to 1 across each row. We then just multiply by 100 to turn this into a percentage. We then use reset_index() to have the 'sample' column (instead of as an index), and then just loop over the cell type columns, renaming them to include the 'percent_' prefix.
per = pd.crosstab(adata_all.obs['sample'], adata_all.obs['population'], normalize = 'index') * 100
per.reset_index(inplace = True)
per.columns = [('percent_' + col if col != 'sample' else 'sample') for col in per.columns]
per
| sample | percent_B cells | percent_Immature neutrophils | percent_Mature neutrophils | percent_Monocytes | percent_Other/Unknown | percent_Plasma cells | percent_T cells | |
|---|---|---|---|---|---|---|---|---|
| 0 | Mock_01_A | 29.87 | 1.50 | 43.60 | 12.53 | 6.76 | 1.51 | 4.23 |
| 1 | Mock_02_A | 31.57 | 1.82 | 42.13 | 11.51 | 7.65 | 1.34 | 3.98 |
| 2 | Mock_03_A | 31.82 | 1.90 | 40.36 | 13.22 | 7.54 | 1.31 | 3.85 |
| 3 | Mock_04_A | 31.60 | 2.18 | 41.10 | 13.27 | 7.00 | 1.42 | 3.43 |
| 4 | Mock_05_B | 30.42 | 2.12 | 44.06 | 12.31 | 6.87 | 1.06 | 3.16 |
| 5 | Mock_06_B | 31.08 | 1.64 | 43.07 | 12.93 | 6.51 | 1.14 | 3.63 |
| 6 | Mock_07_B | 29.99 | 1.92 | 42.56 | 12.22 | 6.58 | 1.00 | 5.73 |
| 7 | Mock_08_B | 29.85 | 3.04 | 43.87 | 14.88 | 5.97 | 0.35 | 2.04 |
| 8 | WNV_01_A | 16.26 | 4.01 | 35.93 | 24.68 | 7.76 | 3.81 | 7.55 |
| 9 | WNV_02_A | 16.30 | 4.75 | 33.96 | 26.21 | 7.96 | 4.08 | 6.74 |
| 10 | WNV_03_A | 23.64 | 4.71 | 29.93 | 24.62 | 7.09 | 4.22 | 5.79 |
| 11 | WNV_04_A | 17.66 | 4.58 | 39.70 | 19.59 | 6.25 | 4.12 | 8.10 |
| 12 | WNV_05_B | 14.68 | 6.69 | 35.84 | 26.16 | 7.81 | 2.44 | 6.38 |
| 13 | WNV_06_B | 22.34 | 4.61 | 35.54 | 18.67 | 7.53 | 2.99 | 8.32 |
| 14 | WNV_07_B | 18.21 | 5.12 | 36.67 | 20.97 | 7.36 | 3.31 | 8.36 |
| 15 | WNV_08_B | 19.88 | 5.74 | 34.35 | 21.94 | 7.40 | 2.94 | 7.75 |
Creating a DataFrame of median intensities per sample and cell type is going to take a bit more work. We start by creating a DataFrame from the X matrix of our AnnData object, taking the var_names as the column names. We add columns to indicate which sample and population each event belongs to (converting the 'population' column to string to avoid issues later).
df = pd.DataFrame(adata_all.X, columns = adata_all.var_names)
df['sample'] = adata_all.obs['sample'].values
df['population'] = adata_all.obs['population'].values
df['population'] = df['population'].astype(str)
df
| CD3e | CD16-32 | Ly6G | CD45 | CD48 | CD11b | B220 | Ly6C | sample | population | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1131.903513 | 1872.116820 | 694.669156 | 2297.480400 | 2463.423637 | 1297.783650 | 3357.454591 | 1169.627907 | Mock_01_A | B cells |
| 1 | 1137.770474 | 950.059162 | 904.626120 | 1983.218740 | 1594.213946 | 1108.404364 | 3150.453893 | 1655.764651 | Mock_01_A | B cells |
| 2 | 1165.112501 | 2302.928927 | 874.856097 | 2195.772398 | 1992.812869 | 2640.528234 | 1279.057770 | 2509.351436 | Mock_01_A | Monocytes |
| 3 | 1396.375650 | 2399.815210 | 821.163524 | 1567.501324 | 2120.240060 | 3068.668942 | 1869.271736 | 2770.973644 | Mock_01_A | Monocytes |
| 4 | 1029.059341 | 2040.455759 | 475.916973 | 1636.949269 | 2727.709972 | 1732.400436 | 2653.900143 | 1262.789876 | Mock_01_A | B cells |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 159995 | 1163.559107 | 2029.841171 | 1002.002421 | 2102.715461 | 2934.872918 | 1585.463913 | 1251.227131 | 2969.636383 | WNV_08_B | Monocytes |
| 159996 | 2205.121457 | 1106.366082 | 393.125111 | 3449.508003 | 2831.568270 | 1406.120611 | 1439.992540 | 1220.899638 | WNV_08_B | T cells |
| 159997 | 986.625108 | 2726.911881 | 3366.489662 | 2176.967886 | 777.658738 | 3172.651040 | 1180.258926 | 2805.032132 | WNV_08_B | Mature neutrophils |
| 159998 | 1052.918856 | 2361.529652 | 2118.016131 | 1828.316497 | 1409.880745 | 3205.967079 | 1333.772320 | 1949.198108 | WNV_08_B | Mature neutrophils |
| 159999 | 1464.842892 | 2128.241108 | 1280.975755 | 1963.775613 | 2261.822362 | 3533.230152 | 1579.737874 | 3386.055862 | WNV_08_B | Immature neutrophils |
160000 rows × 10 columns
Now we group by each combination of 'sample' and 'population' and calculate the median of each column, using reset_index() to bring those grouping variables back in as columns (rather than in the index).
summary_df = df.groupby(['sample', 'population']).median().reset_index()
summary_df
| sample | population | CD3e | CD16-32 | Ly6G | CD45 | CD48 | CD11b | B220 | Ly6C | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mock_01_A | B cells | 1101.761309 | 1423.115437 | 935.259989 | 1749.991288 | 2373.619364 | 1277.806177 | 3090.182943 | 1206.798099 |
| 1 | Mock_01_A | Immature neutrophils | 1165.109258 | 2630.754903 | 1411.483259 | 1820.910273 | 2056.989269 | 3038.914234 | 1307.665802 | 2976.948647 |
| 2 | Mock_01_A | Mature neutrophils | 1119.270504 | 2642.445614 | 2948.457340 | 1941.780051 | 1299.029195 | 3192.043609 | 1215.742070 | 2404.884796 |
| 3 | Mock_01_A | Monocytes | 1162.399840 | 2753.198221 | 751.430865 | 2021.853018 | 2791.237881 | 2847.182505 | 1265.946475 | 3394.221042 |
| 4 | Mock_01_A | Other/Unknown | 1112.818261 | 1842.080300 | 953.763330 | 1782.565249 | 2322.828900 | 1763.119510 | 1259.278593 | 1339.180622 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 107 | WNV_08_B | Mature neutrophils | 1099.134412 | 2665.096695 | 2591.212414 | 2020.342076 | 1531.437595 | 3189.083968 | 1255.425786 | 2238.279776 |
| 108 | WNV_08_B | Monocytes | 1174.162516 | 2807.642119 | 738.106449 | 2446.502994 | 2864.042728 | 2842.670958 | 1334.504967 | 3286.607718 |
| 109 | WNV_08_B | Other/Unknown | 1136.439127 | 1733.113755 | 929.854376 | 2038.594674 | 2377.648570 | 1812.256466 | 1321.196667 | 1508.252040 |
| 110 | WNV_08_B | Plasma cells | 1361.025777 | 1241.655017 | 599.774053 | 2650.891106 | 2719.473533 | 1500.753960 | 3367.519860 | 3303.251160 |
| 111 | WNV_08_B | T cells | 2478.055265 | 1287.741989 | 754.449005 | 2923.971259 | 2540.479341 | 1532.117473 | 1276.494220 | 2808.743431 |
112 rows × 10 columns
What we're heading towards is a DataFrame where we have one row per sample, and a separate column for each summary statistic (e.g. median_B220_B cells). The easiest way to do this is to first melt() the data into a longer format so that we can then create a new column that combines the marker name and population name. When melting the DataFrame below, we indicate the id_vars that uniquely identify each row (and aren't changed during melting), then give the name of the column that will hold the original column names (var_name) and the name of the column that will hold the original column values (value_name).
melted_df = summary_df.melt(
id_vars = ['sample', 'population'],
var_name = 'marker',
value_name = 'median'
)
melted_df
| sample | population | marker | median | |
|---|---|---|---|---|
| 0 | Mock_01_A | B cells | CD3e | 1101.761309 |
| 1 | Mock_01_A | Immature neutrophils | CD3e | 1165.109258 |
| 2 | Mock_01_A | Mature neutrophils | CD3e | 1119.270504 |
| 3 | Mock_01_A | Monocytes | CD3e | 1162.399840 |
| 4 | Mock_01_A | Other/Unknown | CD3e | 1112.818261 |
| ... | ... | ... | ... | ... |
| 891 | WNV_08_B | Mature neutrophils | Ly6C | 2238.279776 |
| 892 | WNV_08_B | Monocytes | Ly6C | 3286.607718 |
| 893 | WNV_08_B | Other/Unknown | Ly6C | 1508.252040 |
| 894 | WNV_08_B | Plasma cells | Ly6C | 3303.251160 |
| 895 | WNV_08_B | T cells | Ly6C | 2808.743431 |
896 rows × 4 columns
Then we just overwrite the 'population' column to contain the marker and population names pasted together.
melted_df['population'] = 'median_' + melted_df['marker'] + '_' + melted_df['population']
melted_df
| sample | population | marker | median | |
|---|---|---|---|---|
| 0 | Mock_01_A | median_CD3e_B cells | CD3e | 1101.761309 |
| 1 | Mock_01_A | median_CD3e_Immature neutrophils | CD3e | 1165.109258 |
| 2 | Mock_01_A | median_CD3e_Mature neutrophils | CD3e | 1119.270504 |
| 3 | Mock_01_A | median_CD3e_Monocytes | CD3e | 1162.399840 |
| 4 | Mock_01_A | median_CD3e_Other/Unknown | CD3e | 1112.818261 |
| ... | ... | ... | ... | ... |
| 891 | WNV_08_B | median_Ly6C_Mature neutrophils | Ly6C | 2238.279776 |
| 892 | WNV_08_B | median_Ly6C_Monocytes | Ly6C | 3286.607718 |
| 893 | WNV_08_B | median_Ly6C_Other/Unknown | Ly6C | 1508.252040 |
| 894 | WNV_08_B | median_Ly6C_Plasma cells | Ly6C | 3303.251160 |
| 895 | WNV_08_B | median_Ly6C_T cells | Ly6C | 2808.743431 |
896 rows × 4 columns
Now we have this, we can do the opposite of melt() and pivot() the DataFrame, telling it the index that uniquely defines each row, the column that will become the new columns, and the column that contains their values. After resetting the index, we display it so you can see what we've done.
med = melted_df.pivot(
index = 'sample',
columns = 'population',
values = 'median'
)
med.reset_index(inplace = True)
med
| population | sample | median_B220_B cells | median_B220_Immature neutrophils | median_B220_Mature neutrophils | median_B220_Monocytes | median_B220_Other/Unknown | median_B220_Plasma cells | median_B220_T cells | median_CD11b_B cells | median_CD11b_Immature neutrophils | ... | median_Ly6C_Other/Unknown | median_Ly6C_Plasma cells | median_Ly6C_T cells | median_Ly6G_B cells | median_Ly6G_Immature neutrophils | median_Ly6G_Mature neutrophils | median_Ly6G_Monocytes | median_Ly6G_Other/Unknown | median_Ly6G_Plasma cells | median_Ly6G_T cells |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mock_01_A | 3090.182943 | 1307.665802 | 1215.742070 | 1265.946475 | 1259.278593 | 3301.787883 | 1099.843622 | 1277.806177 | 3038.914234 | ... | 1339.180622 | 3104.558784 | 2135.911893 | 935.259989 | 1411.483259 | 2948.457340 | 751.430865 | 953.763330 | 767.930608 | 579.926444 |
| 1 | Mock_02_A | 3085.970961 | 1272.196495 | 1208.323320 | 1238.359503 | 1240.840479 | 3249.748679 | 1074.725763 | 1314.362718 | 3063.704116 | ... | 1362.432823 | 3037.105580 | 1499.337551 | 924.586287 | 1471.019450 | 2886.312281 | 755.706007 | 939.438870 | 637.209433 | 579.838249 |
| 2 | Mock_03_A | 3041.428383 | 1281.298668 | 1203.842502 | 1221.915165 | 1234.026814 | 3380.545250 | 1032.733042 | 1217.521631 | 3082.020307 | ... | 1297.972361 | 3005.799229 | 1699.243312 | 920.232321 | 1441.387585 | 3011.716552 | 750.735128 | 925.098875 | 625.057153 | 580.685089 |
| 3 | Mock_04_A | 3034.436092 | 1307.427494 | 1204.489710 | 1244.240153 | 1239.408774 | 3383.755132 | 1060.823943 | 1201.893696 | 3020.714550 | ... | 1322.584431 | 3007.110568 | 1707.953541 | 923.717498 | 1422.946623 | 3051.354286 | 750.476799 | 950.173689 | 573.280932 | 593.830408 |
| 4 | Mock_05_B | 3086.026846 | 1318.373726 | 1215.679054 | 1252.564717 | 1202.721633 | 3339.117042 | 1167.188129 | 1277.911177 | 3131.604786 | ... | 1318.134776 | 3191.162596 | 1560.061943 | 934.937702 | 1479.374802 | 2950.421613 | 717.472601 | 911.502331 | 676.084399 | 807.338025 |
| 5 | Mock_06_B | 3188.706530 | 1256.862809 | 1210.078981 | 1253.486391 | 1248.673720 | 3340.692055 | 1170.962175 | 1236.810068 | 3108.493912 | ... | 1364.021029 | 3154.266730 | 1498.393482 | 970.337384 | 1565.147233 | 3005.693770 | 791.220549 | 986.831826 | 634.909694 | 913.175047 |
| 6 | Mock_07_B | 3317.330232 | 1290.941890 | 1240.125675 | 1256.407507 | 1248.860084 | 3309.160681 | 1164.291024 | 1213.658388 | 3116.276716 | ... | 1350.879769 | 3105.305477 | 1572.118538 | 1033.288541 | 1551.196799 | 3038.078472 | 808.987159 | 1006.440278 | 736.455393 | 933.515144 |
| 7 | Mock_08_B | 3049.365297 | 1384.366367 | 1212.416547 | 1218.975442 | 1217.381433 | 3025.927264 | 1097.460705 | 1261.309565 | 3142.063365 | ... | 1359.002096 | 3129.048865 | 2609.474726 | 973.210710 | 1408.020458 | 3041.685610 | 761.469904 | 1004.248264 | 792.685201 | 781.961909 |
| 8 | WNV_01_A | 3428.948815 | 1370.059547 | 1211.098785 | 1306.238835 | 1293.896179 | 3346.580922 | 1076.180702 | 1283.446672 | 3182.772890 | ... | 1515.006429 | 3111.122996 | 2704.455244 | 849.190544 | 1366.928840 | 2826.922345 | 692.327427 | 930.563518 | 668.818900 | 507.506607 |
| 9 | WNV_02_A | 3358.582495 | 1325.453077 | 1210.412213 | 1297.398187 | 1286.650627 | 3331.798330 | 1080.870984 | 1278.262997 | 3100.696801 | ... | 1542.863123 | 3137.276400 | 2835.085762 | 859.054324 | 1349.238459 | 2751.546040 | 711.855131 | 913.830234 | 672.286197 | 476.394896 |
| 10 | WNV_03_A | 3232.116295 | 1344.004772 | 1236.760987 | 1337.517765 | 1323.466632 | 3324.163809 | 1126.636678 | 1262.099822 | 3041.301944 | ... | 1549.755835 | 3150.929616 | 2937.027799 | 891.618697 | 1363.768406 | 2584.591178 | 713.370671 | 911.547972 | 675.465829 | 388.150655 |
| 11 | WNV_04_A | 3113.307588 | 1321.082754 | 1226.724491 | 1283.410313 | 1311.792947 | 3309.123193 | 1094.183150 | 1232.483117 | 3138.744220 | ... | 1510.893847 | 3188.097947 | 2833.182465 | 861.382952 | 1433.205478 | 2858.445502 | 760.760886 | 913.774101 | 671.511543 | 540.268419 |
| 12 | WNV_05_B | 3507.954731 | 1352.112915 | 1243.638338 | 1333.292239 | 1324.828792 | 3344.040171 | 1288.899325 | 1175.217679 | 3294.425312 | ... | 1519.219566 | 3234.496025 | 2935.956810 | 901.263874 | 1316.214670 | 2627.029764 | 782.437117 | 957.514527 | 737.582369 | 789.656096 |
| 13 | WNV_06_B | 3460.297173 | 1394.089066 | 1235.050305 | 1346.164972 | 1316.592450 | 3432.972806 | 1218.058724 | 1200.619376 | 3053.456469 | ... | 1535.759098 | 3319.703598 | 2554.642605 | 946.795196 | 1360.411652 | 2778.339594 | 786.406039 | 942.652115 | 725.059139 | 843.327820 |
| 14 | WNV_07_B | 3407.560131 | 1420.766418 | 1263.141424 | 1370.854252 | 1343.314896 | 3418.642050 | 1273.326803 | 1231.838057 | 3172.606370 | ... | 1487.360275 | 3352.188993 | 2824.607390 | 935.389096 | 1390.926173 | 2735.123015 | 787.431197 | 939.634776 | 783.109349 | 813.413166 |
| 15 | WNV_08_B | 3532.530264 | 1364.843016 | 1255.425786 | 1334.504967 | 1321.196667 | 3367.519860 | 1276.494220 | 1180.057855 | 3186.802177 | ... | 1508.252040 | 3303.251160 | 2808.743431 | 895.629178 | 1408.825017 | 2591.212414 | 738.106449 | 929.854376 | 599.774053 | 754.449005 |
16 rows × 57 columns
Phew! Nearly there, now we just merge the percentage statistics together with the median intensity statistics, into one DataFrame called stats.
stats = per.merge(med)
stats
| sample | percent_B cells | percent_Immature neutrophils | percent_Mature neutrophils | percent_Monocytes | percent_Other/Unknown | percent_Plasma cells | percent_T cells | median_B220_B cells | median_B220_Immature neutrophils | ... | median_Ly6C_Other/Unknown | median_Ly6C_Plasma cells | median_Ly6C_T cells | median_Ly6G_B cells | median_Ly6G_Immature neutrophils | median_Ly6G_Mature neutrophils | median_Ly6G_Monocytes | median_Ly6G_Other/Unknown | median_Ly6G_Plasma cells | median_Ly6G_T cells | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mock_01_A | 29.87 | 1.50 | 43.60 | 12.53 | 6.76 | 1.51 | 4.23 | 3090.182943 | 1307.665802 | ... | 1339.180622 | 3104.558784 | 2135.911893 | 935.259989 | 1411.483259 | 2948.457340 | 751.430865 | 953.763330 | 767.930608 | 579.926444 |
| 1 | Mock_02_A | 31.57 | 1.82 | 42.13 | 11.51 | 7.65 | 1.34 | 3.98 | 3085.970961 | 1272.196495 | ... | 1362.432823 | 3037.105580 | 1499.337551 | 924.586287 | 1471.019450 | 2886.312281 | 755.706007 | 939.438870 | 637.209433 | 579.838249 |
| 2 | Mock_03_A | 31.82 | 1.90 | 40.36 | 13.22 | 7.54 | 1.31 | 3.85 | 3041.428383 | 1281.298668 | ... | 1297.972361 | 3005.799229 | 1699.243312 | 920.232321 | 1441.387585 | 3011.716552 | 750.735128 | 925.098875 | 625.057153 | 580.685089 |
| 3 | Mock_04_A | 31.60 | 2.18 | 41.10 | 13.27 | 7.00 | 1.42 | 3.43 | 3034.436092 | 1307.427494 | ... | 1322.584431 | 3007.110568 | 1707.953541 | 923.717498 | 1422.946623 | 3051.354286 | 750.476799 | 950.173689 | 573.280932 | 593.830408 |
| 4 | Mock_05_B | 30.42 | 2.12 | 44.06 | 12.31 | 6.87 | 1.06 | 3.16 | 3086.026846 | 1318.373726 | ... | 1318.134776 | 3191.162596 | 1560.061943 | 934.937702 | 1479.374802 | 2950.421613 | 717.472601 | 911.502331 | 676.084399 | 807.338025 |
| 5 | Mock_06_B | 31.08 | 1.64 | 43.07 | 12.93 | 6.51 | 1.14 | 3.63 | 3188.706530 | 1256.862809 | ... | 1364.021029 | 3154.266730 | 1498.393482 | 970.337384 | 1565.147233 | 3005.693770 | 791.220549 | 986.831826 | 634.909694 | 913.175047 |
| 6 | Mock_07_B | 29.99 | 1.92 | 42.56 | 12.22 | 6.58 | 1.00 | 5.73 | 3317.330232 | 1290.941890 | ... | 1350.879769 | 3105.305477 | 1572.118538 | 1033.288541 | 1551.196799 | 3038.078472 | 808.987159 | 1006.440278 | 736.455393 | 933.515144 |
| 7 | Mock_08_B | 29.85 | 3.04 | 43.87 | 14.88 | 5.97 | 0.35 | 2.04 | 3049.365297 | 1384.366367 | ... | 1359.002096 | 3129.048865 | 2609.474726 | 973.210710 | 1408.020458 | 3041.685610 | 761.469904 | 1004.248264 | 792.685201 | 781.961909 |
| 8 | WNV_01_A | 16.26 | 4.01 | 35.93 | 24.68 | 7.76 | 3.81 | 7.55 | 3428.948815 | 1370.059547 | ... | 1515.006429 | 3111.122996 | 2704.455244 | 849.190544 | 1366.928840 | 2826.922345 | 692.327427 | 930.563518 | 668.818900 | 507.506607 |
| 9 | WNV_02_A | 16.30 | 4.75 | 33.96 | 26.21 | 7.96 | 4.08 | 6.74 | 3358.582495 | 1325.453077 | ... | 1542.863123 | 3137.276400 | 2835.085762 | 859.054324 | 1349.238459 | 2751.546040 | 711.855131 | 913.830234 | 672.286197 | 476.394896 |
| 10 | WNV_03_A | 23.64 | 4.71 | 29.93 | 24.62 | 7.09 | 4.22 | 5.79 | 3232.116295 | 1344.004772 | ... | 1549.755835 | 3150.929616 | 2937.027799 | 891.618697 | 1363.768406 | 2584.591178 | 713.370671 | 911.547972 | 675.465829 | 388.150655 |
| 11 | WNV_04_A | 17.66 | 4.58 | 39.70 | 19.59 | 6.25 | 4.12 | 8.10 | 3113.307588 | 1321.082754 | ... | 1510.893847 | 3188.097947 | 2833.182465 | 861.382952 | 1433.205478 | 2858.445502 | 760.760886 | 913.774101 | 671.511543 | 540.268419 |
| 12 | WNV_05_B | 14.68 | 6.69 | 35.84 | 26.16 | 7.81 | 2.44 | 6.38 | 3507.954731 | 1352.112915 | ... | 1519.219566 | 3234.496025 | 2935.956810 | 901.263874 | 1316.214670 | 2627.029764 | 782.437117 | 957.514527 | 737.582369 | 789.656096 |
| 13 | WNV_06_B | 22.34 | 4.61 | 35.54 | 18.67 | 7.53 | 2.99 | 8.32 | 3460.297173 | 1394.089066 | ... | 1535.759098 | 3319.703598 | 2554.642605 | 946.795196 | 1360.411652 | 2778.339594 | 786.406039 | 942.652115 | 725.059139 | 843.327820 |
| 14 | WNV_07_B | 18.21 | 5.12 | 36.67 | 20.97 | 7.36 | 3.31 | 8.36 | 3407.560131 | 1420.766418 | ... | 1487.360275 | 3352.188993 | 2824.607390 | 935.389096 | 1390.926173 | 2735.123015 | 787.431197 | 939.634776 | 783.109349 | 813.413166 |
| 15 | WNV_08_B | 19.88 | 5.74 | 34.35 | 21.94 | 7.40 | 2.94 | 7.75 | 3532.530264 | 1364.843016 | ... | 1508.252040 | 3303.251160 | 2808.743431 | 895.629178 | 1408.825017 | 2591.212414 | 738.106449 | 929.854376 | 599.774053 | 754.449005 |
16 rows × 64 columns
Plotting heatmap of summary statistics¶
One option to visualise the summary statistics we just generated is to use a heatmap. With a large number of antigens and/or populations, this heatmap can grow large and hard to read, but we’ll generate one anyway and see how we can filter it down to more interesting variables.
It can be useful to use a diverging colour palette to highlight differences in the data. There are built-in palettes, but we illustrate how to create your own below.
my_pal = sns.diverging_palette(
h_neg = 250,
h_pos = 10,
s = 99,
l = 50,
sep = 1,
n = 50,
center = "dark"
)
To get a feel for what these arguments control, you can experiment with the choose_diverging_palette() function.
sns.choose_diverging_palette()
interactive(children=(IntSlider(value=220, description='h_neg', max=359), IntSlider(value=10, description='h_p…
[(0.2519971417644415, 0.4987337088076726, 0.5751602783606602), (0.43156001218774975, 0.6160490836499025, 0.6735874169971766), (0.611122882611058, 0.7333644584921324, 0.7720145556336929), (0.7906857530343663, 0.8506798333343624, 0.8704416942702093), (0.95, 0.95, 0.95), (0.9282549678814984, 0.7863704363662967, 0.7963965173228867), (0.9022582584936525, 0.6005186021022944, 0.622400049291663), (0.8762615491058064, 0.4146667678382919, 0.44840358126043944), (0.8510408608937171, 0.23436274952246883, 0.2796010376480583)]
It's useful to add columns of row annotations to the heatmap, which we can do by generating a DataFrame with colours we want mapped to each row.
lut1 = {'Mock': '#E41A1C', 'Virus': '#377EB8'}
row_colors1 = metadata['group'].map(lut1)
lut2 = {1: '#4DAF4A', 2: '#984EA3'}
row_colors2 = metadata['batch'].map(lut2)
row_colors = pd.concat([row_colors1, row_colors2], axis = 1)
row_colors.index = metadata['sample']
row_colors
| group | batch | |
|---|---|---|
| sample | ||
| Mock_01_A | #E41A1C | #4DAF4A |
| Mock_02_A | #E41A1C | #4DAF4A |
| Mock_03_A | #E41A1C | #4DAF4A |
| Mock_04_A | #E41A1C | #4DAF4A |
| WNV_01_A | #377EB8 | #4DAF4A |
| WNV_02_A | #377EB8 | #4DAF4A |
| WNV_03_A | #377EB8 | #4DAF4A |
| WNV_04_A | #377EB8 | #4DAF4A |
| Mock_05_B | #E41A1C | #984EA3 |
| Mock_06_B | #E41A1C | #984EA3 |
| Mock_07_B | #E41A1C | #984EA3 |
| Mock_08_B | #E41A1C | #984EA3 |
| WNV_05_B | #377EB8 | #984EA3 |
| WNV_06_B | #377EB8 | #984EA3 |
| WNV_07_B | #377EB8 | #984EA3 |
| WNV_08_B | #377EB8 | #984EA3 |
Now we turn our 'sample' column into the index.
stats = stats.set_index('sample')
And plot the heatmap, using z_score = 1 to indicate we wish the data to be z-scored within each column (variable). Notice that our row annotations show the two groups of samples cluster separately, but that there is still a batch effect present in the data.
sns.clustermap(
stats,
z_score = 1,
cmap = my_pal,
row_colors = row_colors,
figsize = (16, 8),
cbar_pos = (0.08, 0.12, 0.03, 0.2),
cbar_kws = {'label' : 'Scaled expression'}
)
<seaborn.matrix.ClusterGrid at 0x16b1d6b2800>
Principal component analysis of samples¶
We can use PCA to see if our groups of samples partition separately from each other when projected onto the first two principal components. Because we have variables on different scales (percentages and MFIs), it's important we first compute z-scores for each of our variables.
stats_zscore = stats.apply(scipy.stats.zscore)
stats_zscore
| percent_B cells | percent_Immature neutrophils | percent_Mature neutrophils | percent_Monocytes | percent_Other/Unknown | percent_Plasma cells | percent_T cells | median_B220_B cells | median_B220_Immature neutrophils | median_B220_Mature neutrophils | ... | median_Ly6C_Other/Unknown | median_Ly6C_Plasma cells | median_Ly6C_T cells | median_Ly6G_B cells | median_Ly6G_Immature neutrophils | median_Ly6G_Mature neutrophils | median_Ly6G_Monocytes | median_Ly6G_Other/Unknown | median_Ly6G_Plasma cells | median_Ly6G_T cells | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sample | |||||||||||||||||||||
| Mock_01_A | 0.802854 | -1.236989 | 1.115000 | -0.985028 | -0.638389 | -0.630948 | -0.655261 | -0.884702 | -0.538677 | -0.484373 | ... | -0.975356 | -0.532374 | -0.275018 | 0.284556 | -0.146172 | 0.588752 | -0.072959 | 0.298825 | 1.294187 | -0.618597 |
| Mock_02_A | 1.066753 | -1.041091 | 0.765010 | -1.173643 | 0.907641 | -0.764192 | -0.777969 | -0.908629 | -1.324767 | -0.891918 | ... | -0.726207 | -1.201652 | -1.377121 | 0.051218 | 0.744326 | 0.195435 | 0.060836 | -0.178963 | -0.772626 | -0.619141 |
| Mock_03_A | 1.105562 | -0.992117 | 0.343592 | -0.857435 | 0.716559 | -0.787705 | -0.841777 | -1.161661 | -1.123040 | -1.138068 | ... | -1.416904 | -1.512277 | -1.031023 | -0.043964 | 0.301114 | 0.989122 | -0.094733 | -0.657269 | -0.964764 | -0.613917 |
| Mock_04_A | 1.071410 | -0.820706 | 0.519778 | -0.848189 | -0.221482 | -0.701489 | -1.047926 | -1.201382 | -0.543959 | -1.102514 | ... | -1.153184 | -1.499266 | -1.015943 | 0.032225 | 0.025288 | 1.239990 | -0.102817 | 0.179094 | -1.783390 | -0.532834 |
| Mock_05_B | 0.888233 | -0.857437 | 1.224521 | -1.025709 | -0.447306 | -0.983652 | -1.180451 | -0.908311 | -0.301362 | -0.487835 | ... | -1.200863 | 0.326919 | -1.271988 | 0.277511 | 0.869299 | 0.601184 | -1.135716 | -1.110778 | -0.157980 | 0.784119 |
| Mock_06_B | 0.990688 | -1.151284 | 0.988813 | -0.911061 | -1.072667 | -0.920949 | -0.949760 | -0.325020 | -1.664601 | -0.795472 | ... | -0.709190 | -0.039166 | -1.378755 | 1.051384 | 2.152220 | 0.951003 | 1.172298 | 1.401815 | -0.808987 | 1.436940 |
| Mock_07_B | 0.821483 | -0.979873 | 0.867388 | -1.042352 | -0.951069 | -1.030679 | 0.080987 | 0.405650 | -0.909322 | 0.855124 | ... | -0.849999 | -0.524965 | -1.251115 | 2.427560 | 1.943560 | 1.155967 | 1.728321 | 2.055849 | 0.796537 | 1.562402 |
| Mock_08_B | 0.799750 | -0.294230 | 1.179284 | -0.550474 | -2.010707 | -1.540141 | -1.730182 | -1.116574 | 1.161202 | -0.667059 | ... | -0.762968 | -0.289380 | 0.544862 | 1.114198 | -0.197966 | 1.178796 | 0.241223 | 1.982735 | 1.685578 | 0.627594 |
| WNV_01_A | -1.309890 | 0.299586 | -0.711142 | 1.261709 | 1.098723 | 1.171761 | 0.974301 | 1.039720 | 0.844127 | -0.739449 | ... | 0.908625 | -0.467243 | 0.709302 | -1.597010 | -0.812585 | -0.180445 | -1.922658 | -0.474998 | -0.272854 | -1.065295 |
| WNV_02_A | -1.303680 | 0.752601 | -1.180177 | 1.544631 | 1.446146 | 1.383383 | 0.576727 | 0.639991 | -0.144466 | -0.777166 | ... | 1.207111 | -0.207746 | 0.935463 | -1.381378 | -1.077184 | -0.657504 | -1.311520 | -1.033131 | -0.218033 | -1.257198 |
| WNV_03_A | -0.164258 | 0.728113 | -2.139675 | 1.250614 | -0.065142 | 1.493113 | 0.110437 | -0.078423 | 0.266687 | 0.670287 | ... | 1.280967 | -0.072277 | 1.111956 | -0.669487 | -0.859856 | -1.714165 | -1.264090 | -1.109256 | -0.167760 | -1.801504 |
| WNV_04_A | -1.092561 | 0.648530 | 0.186453 | 0.320484 | -1.524316 | 1.414735 | 1.244259 | -0.753338 | -0.241323 | 0.118939 | ... | 0.864559 | 0.296512 | 0.932168 | -1.330471 | 0.178733 | 0.019065 | 0.219033 | -1.035004 | -0.230281 | -0.863214 |
| WNV_05_B | -1.555160 | 1.940233 | -0.732570 | 1.535386 | 1.185579 | 0.097973 | 0.400028 | 1.488527 | 0.446384 | 1.048090 | ... | 0.953769 | 0.756879 | 1.110101 | -0.458634 | -1.571130 | -1.445570 | 0.897412 | 0.423945 | 0.814356 | 0.675053 |
| WNV_06_B | -0.366063 | 0.666895 | -0.803997 | 0.150360 | 0.699188 | 0.529056 | 1.352242 | 1.217800 | 1.376682 | 0.576312 | ... | 1.130991 | 1.602318 | 0.449931 | 0.536728 | -0.910064 | -0.487927 | 1.021623 | -0.071786 | 0.616353 | 1.006110 |
| WNV_07_B | -1.007182 | 0.979108 | -0.534956 | 0.575668 | 0.403879 | 0.779868 | 1.371875 | 0.918218 | 1.967920 | 2.119479 | ... | 0.612396 | 1.924642 | 0.917322 | 0.287379 | -0.453650 | -0.761445 | 1.053706 | -0.172429 | 1.534176 | 0.821591 |
| WNV_08_B | -0.747940 | 1.358660 | -1.087322 | 0.755038 | 0.473363 | 0.489867 | 1.072468 | 1.628133 | 0.728515 | 1.695625 | ... | 0.836252 | 1.439075 | 0.889857 | -0.581814 | -0.185932 | -1.672259 | -0.489959 | -0.498651 | -1.364512 | 0.457890 |
16 rows × 63 columns
We specify the number of components we want to be returned (just 2 for plotting) using the PCA() function, then actually compute the projections using the fit_transform() method. This gives us an array with one row per sample, and a column for each principal component projection.
pca = PCA(n_components = 2)
stats_pca = pca.fit_transform(stats_zscore)
stats_pca
array([[-4.93095277, -0.57487011],
[-6.37596216, -1.27243114],
[-5.32943365, -3.3606912 ],
[-4.86278672, -3.58585446],
[-4.83576239, 1.35172841],
[-5.2420011 , 3.17651377],
[-4.39018757, 5.36285259],
[-3.9788662 , 2.25648044],
[ 3.62261647, -3.90003197],
[ 4.47680926, -4.23085216],
[ 4.37420167, -4.21502922],
[ 2.91086004, -3.21233533],
[ 7.70780641, 1.72354253],
[ 4.18458799, 4.41752128],
[ 6.39532467, 3.79089147],
[ 6.27374605, 2.27256511]])
We can add these as columns to our stats DataFrame.
stats['PCA1'] = stats_pca[:, 0]
stats['PCA2'] = stats_pca[:, 1]
stats
| percent_B cells | percent_Immature neutrophils | percent_Mature neutrophils | percent_Monocytes | percent_Other/Unknown | percent_Plasma cells | percent_T cells | median_B220_B cells | median_B220_Immature neutrophils | median_B220_Mature neutrophils | ... | median_Ly6C_T cells | median_Ly6G_B cells | median_Ly6G_Immature neutrophils | median_Ly6G_Mature neutrophils | median_Ly6G_Monocytes | median_Ly6G_Other/Unknown | median_Ly6G_Plasma cells | median_Ly6G_T cells | PCA1 | PCA2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sample | |||||||||||||||||||||
| Mock_01_A | 29.87 | 1.50 | 43.60 | 12.53 | 6.76 | 1.51 | 4.23 | 3090.182943 | 1307.665802 | 1215.742070 | ... | 2135.911893 | 935.259989 | 1411.483259 | 2948.457340 | 751.430865 | 953.763330 | 767.930608 | 579.926444 | -4.930953 | -0.574870 |
| Mock_02_A | 31.57 | 1.82 | 42.13 | 11.51 | 7.65 | 1.34 | 3.98 | 3085.970961 | 1272.196495 | 1208.323320 | ... | 1499.337551 | 924.586287 | 1471.019450 | 2886.312281 | 755.706007 | 939.438870 | 637.209433 | 579.838249 | -6.375962 | -1.272431 |
| Mock_03_A | 31.82 | 1.90 | 40.36 | 13.22 | 7.54 | 1.31 | 3.85 | 3041.428383 | 1281.298668 | 1203.842502 | ... | 1699.243312 | 920.232321 | 1441.387585 | 3011.716552 | 750.735128 | 925.098875 | 625.057153 | 580.685089 | -5.329434 | -3.360691 |
| Mock_04_A | 31.60 | 2.18 | 41.10 | 13.27 | 7.00 | 1.42 | 3.43 | 3034.436092 | 1307.427494 | 1204.489710 | ... | 1707.953541 | 923.717498 | 1422.946623 | 3051.354286 | 750.476799 | 950.173689 | 573.280932 | 593.830408 | -4.862787 | -3.585854 |
| Mock_05_B | 30.42 | 2.12 | 44.06 | 12.31 | 6.87 | 1.06 | 3.16 | 3086.026846 | 1318.373726 | 1215.679054 | ... | 1560.061943 | 934.937702 | 1479.374802 | 2950.421613 | 717.472601 | 911.502331 | 676.084399 | 807.338025 | -4.835762 | 1.351728 |
| Mock_06_B | 31.08 | 1.64 | 43.07 | 12.93 | 6.51 | 1.14 | 3.63 | 3188.706530 | 1256.862809 | 1210.078981 | ... | 1498.393482 | 970.337384 | 1565.147233 | 3005.693770 | 791.220549 | 986.831826 | 634.909694 | 913.175047 | -5.242001 | 3.176514 |
| Mock_07_B | 29.99 | 1.92 | 42.56 | 12.22 | 6.58 | 1.00 | 5.73 | 3317.330232 | 1290.941890 | 1240.125675 | ... | 1572.118538 | 1033.288541 | 1551.196799 | 3038.078472 | 808.987159 | 1006.440278 | 736.455393 | 933.515144 | -4.390188 | 5.362853 |
| Mock_08_B | 29.85 | 3.04 | 43.87 | 14.88 | 5.97 | 0.35 | 2.04 | 3049.365297 | 1384.366367 | 1212.416547 | ... | 2609.474726 | 973.210710 | 1408.020458 | 3041.685610 | 761.469904 | 1004.248264 | 792.685201 | 781.961909 | -3.978866 | 2.256480 |
| WNV_01_A | 16.26 | 4.01 | 35.93 | 24.68 | 7.76 | 3.81 | 7.55 | 3428.948815 | 1370.059547 | 1211.098785 | ... | 2704.455244 | 849.190544 | 1366.928840 | 2826.922345 | 692.327427 | 930.563518 | 668.818900 | 507.506607 | 3.622616 | -3.900032 |
| WNV_02_A | 16.30 | 4.75 | 33.96 | 26.21 | 7.96 | 4.08 | 6.74 | 3358.582495 | 1325.453077 | 1210.412213 | ... | 2835.085762 | 859.054324 | 1349.238459 | 2751.546040 | 711.855131 | 913.830234 | 672.286197 | 476.394896 | 4.476809 | -4.230852 |
| WNV_03_A | 23.64 | 4.71 | 29.93 | 24.62 | 7.09 | 4.22 | 5.79 | 3232.116295 | 1344.004772 | 1236.760987 | ... | 2937.027799 | 891.618697 | 1363.768406 | 2584.591178 | 713.370671 | 911.547972 | 675.465829 | 388.150655 | 4.374202 | -4.215029 |
| WNV_04_A | 17.66 | 4.58 | 39.70 | 19.59 | 6.25 | 4.12 | 8.10 | 3113.307588 | 1321.082754 | 1226.724491 | ... | 2833.182465 | 861.382952 | 1433.205478 | 2858.445502 | 760.760886 | 913.774101 | 671.511543 | 540.268419 | 2.910860 | -3.212335 |
| WNV_05_B | 14.68 | 6.69 | 35.84 | 26.16 | 7.81 | 2.44 | 6.38 | 3507.954731 | 1352.112915 | 1243.638338 | ... | 2935.956810 | 901.263874 | 1316.214670 | 2627.029764 | 782.437117 | 957.514527 | 737.582369 | 789.656096 | 7.707806 | 1.723543 |
| WNV_06_B | 22.34 | 4.61 | 35.54 | 18.67 | 7.53 | 2.99 | 8.32 | 3460.297173 | 1394.089066 | 1235.050305 | ... | 2554.642605 | 946.795196 | 1360.411652 | 2778.339594 | 786.406039 | 942.652115 | 725.059139 | 843.327820 | 4.184588 | 4.417521 |
| WNV_07_B | 18.21 | 5.12 | 36.67 | 20.97 | 7.36 | 3.31 | 8.36 | 3407.560131 | 1420.766418 | 1263.141424 | ... | 2824.607390 | 935.389096 | 1390.926173 | 2735.123015 | 787.431197 | 939.634776 | 783.109349 | 813.413166 | 6.395325 | 3.790891 |
| WNV_08_B | 19.88 | 5.74 | 34.35 | 21.94 | 7.40 | 2.94 | 7.75 | 3532.530264 | 1364.843016 | 1255.425786 | ... | 2808.743431 | 895.629178 | 1408.825017 | 2591.212414 | 738.106449 | 929.854376 | 599.774053 | 754.449005 | 6.273746 | 2.272565 |
16 rows × 65 columns
So that we can plot the principal component projections, lets add the sample metadata to stats (by merging on a common index of the sample column).
metadata.set_index('sample', inplace = True)
stats = stats.join(metadata)
stats
| percent_B cells | percent_Immature neutrophils | percent_Mature neutrophils | percent_Monocytes | percent_Other/Unknown | percent_Plasma cells | percent_T cells | median_B220_B cells | median_B220_Immature neutrophils | median_B220_Mature neutrophils | ... | median_Ly6G_Other/Unknown | median_Ly6G_Plasma cells | median_Ly6G_T cells | PCA1 | PCA2 | filename | group | batch_label | batch | reference | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sample | |||||||||||||||||||||
| Mock_01_A | 29.87 | 1.50 | 43.60 | 12.53 | 6.76 | 1.51 | 4.23 | 3090.182943 | 1307.665802 | 1215.742070 | ... | 953.763330 | 767.930608 | 579.926444 | -4.930953 | -0.574870 | Mock_01_A.fcs | Mock | A | 1 | ref |
| Mock_02_A | 31.57 | 1.82 | 42.13 | 11.51 | 7.65 | 1.34 | 3.98 | 3085.970961 | 1272.196495 | 1208.323320 | ... | 939.438870 | 637.209433 | 579.838249 | -6.375962 | -1.272431 | Mock_02_A.fcs | Mock | A | 1 | other |
| Mock_03_A | 31.82 | 1.90 | 40.36 | 13.22 | 7.54 | 1.31 | 3.85 | 3041.428383 | 1281.298668 | 1203.842502 | ... | 925.098875 | 625.057153 | 580.685089 | -5.329434 | -3.360691 | Mock_03_A.fcs | Mock | A | 1 | other |
| Mock_04_A | 31.60 | 2.18 | 41.10 | 13.27 | 7.00 | 1.42 | 3.43 | 3034.436092 | 1307.427494 | 1204.489710 | ... | 950.173689 | 573.280932 | 593.830408 | -4.862787 | -3.585854 | Mock_04_A.fcs | Mock | A | 1 | other |
| Mock_05_B | 30.42 | 2.12 | 44.06 | 12.31 | 6.87 | 1.06 | 3.16 | 3086.026846 | 1318.373726 | 1215.679054 | ... | 911.502331 | 676.084399 | 807.338025 | -4.835762 | 1.351728 | Mock_05_B.fcs | Mock | B | 2 | ref |
| Mock_06_B | 31.08 | 1.64 | 43.07 | 12.93 | 6.51 | 1.14 | 3.63 | 3188.706530 | 1256.862809 | 1210.078981 | ... | 986.831826 | 634.909694 | 913.175047 | -5.242001 | 3.176514 | Mock_06_B.fcs | Mock | B | 2 | other |
| Mock_07_B | 29.99 | 1.92 | 42.56 | 12.22 | 6.58 | 1.00 | 5.73 | 3317.330232 | 1290.941890 | 1240.125675 | ... | 1006.440278 | 736.455393 | 933.515144 | -4.390188 | 5.362853 | Mock_07_B.fcs | Mock | B | 2 | other |
| Mock_08_B | 29.85 | 3.04 | 43.87 | 14.88 | 5.97 | 0.35 | 2.04 | 3049.365297 | 1384.366367 | 1212.416547 | ... | 1004.248264 | 792.685201 | 781.961909 | -3.978866 | 2.256480 | Mock_08_B.fcs | Mock | B | 2 | other |
| WNV_01_A | 16.26 | 4.01 | 35.93 | 24.68 | 7.76 | 3.81 | 7.55 | 3428.948815 | 1370.059547 | 1211.098785 | ... | 930.563518 | 668.818900 | 507.506607 | 3.622616 | -3.900032 | Virus_01_A.fcs | Virus | A | 1 | other |
| WNV_02_A | 16.30 | 4.75 | 33.96 | 26.21 | 7.96 | 4.08 | 6.74 | 3358.582495 | 1325.453077 | 1210.412213 | ... | 913.830234 | 672.286197 | 476.394896 | 4.476809 | -4.230852 | Virus_02_A.fcs | Virus | A | 1 | other |
| WNV_03_A | 23.64 | 4.71 | 29.93 | 24.62 | 7.09 | 4.22 | 5.79 | 3232.116295 | 1344.004772 | 1236.760987 | ... | 911.547972 | 675.465829 | 388.150655 | 4.374202 | -4.215029 | Virus_03_A.fcs | Virus | A | 1 | other |
| WNV_04_A | 17.66 | 4.58 | 39.70 | 19.59 | 6.25 | 4.12 | 8.10 | 3113.307588 | 1321.082754 | 1226.724491 | ... | 913.774101 | 671.511543 | 540.268419 | 2.910860 | -3.212335 | Virus_04_A.fcs | Virus | A | 1 | other |
| WNV_05_B | 14.68 | 6.69 | 35.84 | 26.16 | 7.81 | 2.44 | 6.38 | 3507.954731 | 1352.112915 | 1243.638338 | ... | 957.514527 | 737.582369 | 789.656096 | 7.707806 | 1.723543 | Virus_05_B.fcs | Virus | B | 2 | other |
| WNV_06_B | 22.34 | 4.61 | 35.54 | 18.67 | 7.53 | 2.99 | 8.32 | 3460.297173 | 1394.089066 | 1235.050305 | ... | 942.652115 | 725.059139 | 843.327820 | 4.184588 | 4.417521 | Virus_06_B.fcs | Virus | B | 2 | other |
| WNV_07_B | 18.21 | 5.12 | 36.67 | 20.97 | 7.36 | 3.31 | 8.36 | 3407.560131 | 1420.766418 | 1263.141424 | ... | 939.634776 | 783.109349 | 813.413166 | 6.395325 | 3.790891 | Virus_07_B.fcs | Virus | B | 2 | other |
| WNV_08_B | 19.88 | 5.74 | 34.35 | 21.94 | 7.40 | 2.94 | 7.75 | 3532.530264 | 1364.843016 | 1255.425786 | ... | 929.854376 | 599.774053 | 754.449005 | 6.273746 | 2.272565 | Virus_08_B.fcs | Virus | B | 2 | other |
16 rows × 70 columns
And now we visualize the projections, colouring by group and using different shapes for the batches. It seems as though PCA1 largely captures the differences between groups (which seem substantial), while PCA2 largely captures the remaining batch effect.
sns.scatterplot(
stats,
x = 'PCA1',
y = 'PCA2',
hue = 'group',
style = 'batch'
)
<Axes: xlabel='PCA1', ylabel='PCA2'>
Pairwise comparisons¶
To identify summary statistics that are unlikely to be the same across our groups, we can perform pairwise Mann-Whitney U tests (non-parametric equivalent of t test) and compute fold changes and false discovery rate (FDR)-corrected p values. There are more sophisitcated things we can do (especially to control for that batch effect), but this will do as a first pass.
Let's start by splitting the DataFrame into one corresponding to the mock samples, and another corresponding to the virus samples.
mock = stats[stats['group'] == 'Mock']
virus = stats[stats['group'] == 'Virus']
Then, we iterate over each statistic, perform a Mann-Whitney U test, calculate a log2 fold change of medians, and return a dictionary of fold change, log2 fold change, and p-value. We convert this into a DataFrame, and add a False Discovery Rate-corrected p-value, its -log10 value (for creating a volcano plot), and a column that we can colour by. There's quite a lot going on below but take it one step at a time and you should be able to follow it.
results = {}
for column in stats.columns[:63]:
u_statistic, p_value = scipy.stats.mannwhitneyu(mock[column], virus[column])
fc = virus[column].median() / mock[column].median()
log2fc = np.log2(fc)
results[column] = {
'FC' : fc,
'Log2FC' : log2fc,
'p-value' : p_value
}
stats_tab = pd.DataFrame(results).transpose()
stats_tab['FDR_p-value'] = scipy.stats.false_discovery_control(stats_tab['p-value'])
stats_tab['neg_log10_FDR_p-value'] = -np.log10(stats_tab['FDR_p-value'])
stats_tab['Sig'] = ['p > 0.05' if p > 0.05 else 'p < 0.05' for p in stats_tab['FDR_p-value']]
stats_tab
| FC | Log2FC | p-value | FDR_p-value | neg_log10_FDR_p-value | Sig | |
|---|---|---|---|---|---|---|
| percent_B cells | 0.583252 | -0.777809 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| percent_Immature neutrophils | 2.476440 | 1.308268 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| percent_Mature neutrophils | 0.833586 | -0.262596 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| percent_Monocytes | 1.828751 | 0.870859 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| percent_Other/Unknown | 1.095378 | 0.131429 | 0.082984 | 0.137578 | 0.861450 | p > 0.05 |
| ... | ... | ... | ... | ... | ... | ... |
| median_Ly6G_Mature neutrophils | 0.911799 | -0.133212 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| median_Ly6G_Monocytes | 0.994513 | -0.007938 | 0.441803 | 0.545756 | 0.263001 | p > 0.05 |
| median_Ly6G_Other/Unknown | 0.977143 | -0.033359 | 0.160528 | 0.229847 | 0.638560 | p > 0.05 |
| median_Ly6G_Plasma cells | 1.026238 | 0.037365 | 0.798446 | 0.852578 | 0.069266 | p > 0.05 |
| median_Ly6G_T cells | 0.941070 | -0.087625 | 0.278632 | 0.373486 | 0.427726 | p > 0.05 |
63 rows × 6 columns
We can visualise the log2 fold change against the p values using a volcano plot.
p = sns.scatterplot(
stats_tab,
x = 'Log2FC',
y = 'neg_log10_FDR_p-value',
hue = 'Sig'
)
plt.xlim(-1.6, 1.6)
plt.axhline(y = 1.3)
plt.axvline(x = -np.log2(1.03))
plt.axvline(x = np.log2(1.03))
<matplotlib.lines.Line2D at 0x16b25e9f9a0>
Filtering comparisons of interest¶
Let’s focus in on comparisons with fold changes and p values above certain thresholds. Let’s define any feature having a (FDR-corrected) p value smaller than 0.05, and having a log2-fold change value >= 0.04 (this corresponds to a fold change of ~1.02), as of interest. I've set the bar very low but you can customise this.
fold_filter = abs(stats_tab['Log2FC']) >= np.log2(0.04)
p_filter = stats_tab['FDR_p-value'] < 0.05
stats_filtered = stats_tab[fold_filter & p_filter]
stats_filtered
| FC | Log2FC | p-value | FDR_p-value | neg_log10_FDR_p-value | Sig | |
|---|---|---|---|---|---|---|
| percent_B cells | 0.583252 | -0.777809 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| percent_Immature neutrophils | 2.476440 | 1.308268 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| percent_Mature neutrophils | 0.833586 | -0.262596 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| percent_Monocytes | 1.828751 | 0.870859 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| percent_Plasma cells | 2.906122 | 1.539095 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| percent_T cells | 2.045455 | 1.032421 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| median_B220_B cells | 1.107665 | 0.147522 | 0.001088 | 0.003263 | 2.486329 | p < 0.05 |
| median_B220_Immature neutrophils | 1.045639 | 0.064385 | 0.004662 | 0.011748 | 1.930027 | p < 0.05 |
| median_B220_Monocytes | 1.068484 | 0.095566 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| median_B220_Other/Unknown | 1.063518 | 0.088844 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| median_CD11b_T cells | 1.067611 | 0.094385 | 0.001865 | 0.004895 | 2.310238 | p < 0.05 |
| median_CD3e_B cells | 1.020668 | 0.029514 | 0.000622 | 0.001958 | 2.708178 | p < 0.05 |
| median_CD3e_Mature neutrophils | 0.990726 | -0.013443 | 0.014763 | 0.032071 | 1.493882 | p < 0.05 |
| median_CD3e_Monocytes | 1.029150 | 0.041453 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| median_CD3e_Other/Unknown | 1.018341 | 0.026221 | 0.014763 | 0.032071 | 1.493882 | p < 0.05 |
| median_CD45_B cells | 1.367670 | 0.451720 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| median_CD45_Immature neutrophils | 1.149965 | 0.201590 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| median_CD45_Monocytes | 1.166929 | 0.222717 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| median_CD45_Other/Unknown | 1.132105 | 0.179008 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| median_CD45_T cells | 1.025594 | 0.036460 | 0.020668 | 0.040691 | 1.390506 | p < 0.05 |
| median_CD48_Immature neutrophils | 1.068819 | 0.096018 | 0.014763 | 0.032071 | 1.493882 | p < 0.05 |
| median_CD48_Mature neutrophils | 1.147134 | 0.198035 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| median_CD48_Plasma cells | 0.961270 | -0.056986 | 0.001865 | 0.004895 | 2.310238 | p < 0.05 |
| median_Ly6C_B cells | 1.181472 | 0.240586 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| median_Ly6C_Mature neutrophils | 0.931586 | -0.102240 | 0.000311 | 0.001031 | 2.986932 | p < 0.05 |
| median_Ly6C_Monocytes | 0.967948 | -0.046999 | 0.020668 | 0.040691 | 1.390506 | p < 0.05 |
| median_Ly6C_Other/Unknown | 1.127940 | 0.173690 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
| median_Ly6C_Plasma cells | 1.034257 | 0.048594 | 0.010412 | 0.025229 | 1.598107 | p < 0.05 |
| median_Ly6C_T cells | 1.729491 | 0.790347 | 0.000311 | 0.001031 | 2.986932 | p < 0.05 |
| median_Ly6G_B cells | 0.955646 | -0.065451 | 0.020668 | 0.040691 | 1.390506 | p < 0.05 |
| median_Ly6G_Immature neutrophils | 0.937608 | -0.092943 | 0.001865 | 0.004895 | 2.310238 | p < 0.05 |
| median_Ly6G_Mature neutrophils | 0.911799 | -0.133212 | 0.000155 | 0.000576 | 3.239657 | p < 0.05 |
Next, we use the index of features that meet these criteria and re plot our heatmap again, but this time we only include the filtered features based on our statistical analysis.
sns.clustermap(
stats[stats_filtered.index],
z_score = 1,
cmap = my_pal,
row_colors = row_colors,
figsize = (16, 8),
cbar_pos = (0.08, 0.12, 0.03, 0.2),
cbar_kws = {'label' : 'Scaled expression'}
)
<seaborn.matrix.ClusterGrid at 0x16b23372170>
For individual plots comparing the groups per statistic, we create boxplots.
for stat in stats[stats_filtered.index].columns:
sns.catplot(
data = stats,
x = 'group',
y = stat,
hue = 'group',
kind = 'box'
)
c:\Python_3.10.2\lib\site-packages\seaborn\axisgrid.py:447: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`. fig = plt.figure(figsize=figsize)